-
Notifications
You must be signed in to change notification settings - Fork 603
[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted. |
Thanks for your notice! From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like |
I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think? |
Ok, I will take your suggestion and update the PR soon! |
@troychiu PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contribution!
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
ray-operator/test/support/metrics.go
Outdated
|
||
// CreateAndExecuteMetricsRequest is a test helper that creates an HTTP GET request to the /metrics endpoint, | ||
// executes it against a Prometheus handler using the provided registry, and returns the request, response recorder, and handler. | ||
func CreateAndExecuteMetricsRequest(t *testing.T, reg *prometheus.Registry) (*http.Request, *httptest.ResponseRecorder, http.Handler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard for me to understand the usage of this helper function. It not only calls ServeHTTP to send the request but also returns the handler so that it can be used again. For a test case having multiple request sent, i think it's a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3923 (comment)
@win5923 What do you think? I can accept either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a helper function makes sense, but the functionality of the helper is a bit confusing. It would be clearer if it either only returns a handler that caller can reuse or it just sends the request for the caller. Let me know if you feel confused!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Troy’s point. we can simply just sends the request and returns the response recorder.
WDYT?
// ExecuteMetricsRequest executes a GET request to /metrics and returns the response recorder
func ExecuteMetricsRequest(t *testing.T, handler http.Handler) (*http.Request, *httptest.ResponseRecorder) {
t.Helper()
req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "/metrics", nil)
require.NoError(t, err)
rr := httptest.NewRecorder()
handler.ServeHTTP(rr, req)
return rr
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@win5923 I just refactored in the latest commit with:
func GetMetricsResponseAndCode(t *testing.T, reg *prometheus.Registry) (string, int) {
t.Helper()
req, err := http.NewRequestWithContext(t.Context(), http.MethodGet, "/metrics", nil)
require.NoError(t, err)
rr := httptest.NewRecorder()
handler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{})
handler.ServeHTTP(rr, req)
return rr.Body.String(), rr.Code
}
Does it look ok? I found it not really necessary to reuse the handler and we only care the response message & code in the test code? cc @troychiu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think this is better.
Let's just wait for Troy's comment. Thanks!
Why are these changes needed?
Since some of our metrics are permanently stored in Prometheus, that might cause the
/metrics
endpoint to become slow or time out, we need a lifecycle-based cleanup.Related issue number
Closes #3820
End-to-end test example
After we clean CR, there will be no more metrics
Checks